4 research outputs found
An automatic diacritization algorithm for undiacritized Arabic text
Modern Standard Arabic (MSA) is used today in most written and some spoken media. It is, however, not the native dialect of any country. Recently, the rate of the written dialectal Arabic text increased dramatically. Most of these texts have been written in the Egyptian dialectal, as it is considered the most widely used dialect and understandable throughout the Middle East. Like other Semitic languages, in written Arabic, short vowels are not written, but are represented by diacritic marks.
Nonetheless, these marks are not used in most of the modern Arabic texts (for example books and newspapers). The absence of diacritic marks creates a huge ambiguity, as the un-diacritized word may correspond to more than one correct
diacritization (vowelization) form. Hence, the aim of this research is to reduce the ambiguity of the absences of diacritic marks using hybrid algorithm with significantly higher accuracy than the state-of-the-art systems for MSA. Moreover, this research is to implement and evaluate the accuracy of the algorithm for dialectal Arabic text. The design of the proposed algorithm based on two main techniques as follows: statistical n-gram along with maximum likelihood estimation and morphological analyzer. Merging the word, morpheme, and letter levels with their sub-models together into one platform in order to improve the automatic
diacritization accuracy is the proposition of this research. Moreover, by utilizing the
feature of the case ending diacritization, which is ignoring the diacritic mark on the last letter of the word, shows a significant error improvement. The reason for this remarkable improvement is that the Arabic language prohibits adding diacritic marks over some letters. The hybrid algorithm demonstrated a good performance of 97.9% when applied to MSA corpora (Tashkeela), 97.1% when applied on LDC’s Arabic Treebank-Part 3 v1.0 and 91.8% when applied to Egyptian dialectal corpus (CallHome). The main contribution of this research is the hybrid algorithm for automatic diacritization of undiacritized MSA text and dialectal Arabic text. The proposed algorithm applied and evaluated on Egyptian colloquial dialect, the most widely dialect understood and used throughout the Arab world, which is considered
as first time based on the literature review
3D objects and scenes classification, recognition, segmentation, and reconstruction using 3D point cloud data: A review
Three-dimensional (3D) point cloud analysis has become one of the attractive
subjects in realistic imaging and machine visions due to its simplicity,
flexibility and powerful capacity of visualization. Actually, the
representation of scenes and buildings using 3D shapes and formats leveraged
many applications among which automatic driving, scenes and objects
reconstruction, etc. Nevertheless, working with this emerging type of data has
been a challenging task for objects representation, scenes recognition,
segmentation, and reconstruction. In this regard, a significant effort has
recently been devoted to developing novel strategies, using different
techniques such as deep learning models. To that end, we present in this paper
a comprehensive review of existing tasks on 3D point cloud: a well-defined
taxonomy of existing techniques is performed based on the nature of the adopted
algorithms, application scenarios, and main objectives. Various tasks performed
on 3D point could data are investigated, including objects and scenes
detection, recognition, segmentation and reconstruction. In addition, we
introduce a list of used datasets, we discuss respective evaluation metrics and
we compare the performance of existing solutions to better inform the
state-of-the-art and identify their limitations and strengths. Lastly, we
elaborate on current challenges facing the subject of technology and future
trends attracting considerable interest, which could be a starting point for
upcoming research studie
Crosslingual automatic diacritization for Egyptian Colloquial Dialect
In this paper, the problem of missing diacritic marks in most of dialectal Arabic written resources is addressed. Our aim is to implement a scalable and extensible platform for automatically retrieving the diacritic marks for undiacritized dialectal Arabic texts. Different rule-based and statistical techniques are proposed. These include: morphological analyzer-based, maximum likelihood estimate, and statistical n-gram models. The proposed platform includes helper tools for text preprocessing and encoding conversion. Diacritization accuracy of each technique is evaluated in terms of Diacritic Error Rate (DER) and Word Error Rate (WER). The approach trains several n-gram models on different lexical units. A data pool of both Modern Standard Arabic (MSA) data along with Dialectal Arabic data was used to train the models. 2016 IEEE.Scopu
Automatic diacritics restoration for modern standard Arabic text
In this paper, the problem of missing diacritic marks in most of Arabic written resources is investigated. Our aim is to implement a scalable and extensible platform to automatically restore missing diacritic marks for Modern Standard Arabic text. Different rule-based and statistical techniques are proposed. These include: morphological analyzer-based, maximum likelihood estimate, and statistical n-gram models. Diacritization accuracy of each technique was evaluated based on Diacritic Error Rate (DER) and Word Error Rate (WER). The proposed platform includes helper tools for text preprocessing and encoding conversion. It yielded a WER of 7.1% and DER of 3.9%. When the case ending was ignored, the platform yielded a WER and DER of 5.1% and 2.7%, respectively. 2016 IEEE.Scopu